A. The Black Box Problem

You cannot ship what you cannot trace

Agenda

  • A. The Black Box Problem — Why agent debugging is different
  • B. Structured Tracing — Seeing inside the agent’s brain
  • C. Loop Detection — Catching agents that spin in circles
  • D. Cost Tracking — Monitoring spend per query
  • E. LLM-as-Judge Evaluation — Measuring agent quality scientifically
  • F. Wrap-up — Key takeaways & lab preview

Traditional vs Agent Debugging

Traditional Debugging Agent Debugging
Same input -> same output Same input -> different outputs
Stack trace shows exact failure Failure emerges over multiple steps
Unit tests with assertEqual Subjective quality evaluation
Fixed cost per execution Variable cost (1-50 LLM calls)
Errors crash the program Errors may be silently “reasoned away”

The Core Problem

An agent might technically succeed (no crash, produces an answer) while being completely wrong. Or it might spend $0.50 on a $0.02 question. You can’t fix what you can’t see.

Five Ways Agents Fail

  1. Prompt ambiguity — the agent didn’t understand the task
  2. Tool misuse — right tool, wrong arguments
  3. Formatting errors — tried to format JSON, failed
  4. Infinite loops — kept searching “Python” 50 times
  5. Hallucination — confidently lied about a search result it never saw

Production Tip: Never deploy an agent without tracing. Costs can explode if an agent enters an infinite loop.

B. Structured Tracing

Every step, captured and queryable

What a Good Trace Captures

For every step in the agent loop:

  • Trace ID (unique per request)
  • Step number
  • Agent’s reasoning (LLM content)
  • Tool calls (name + arguments)
  • Tool results
  • Token usage (input/output)
  • Cost per step (USD)
  • Duration (milliseconds)
[Step 3] (450ms, $0.0042)
  Reasoning: "The search returned results about Paris. Let me..."
  Tool: search({"query": "population of Paris 2024"})
  Result: "The population of Paris is approximately 2.1 million..."

The Trace Data Model

@dataclass
class ToolCallRecord:
    tool_name: str
    tool_input: dict
    tool_output: str
    duration_ms: float

@dataclass
class AgentStep:
    step_number: int
    reasoning: Optional[str]
    tool_calls: list[ToolCallRecord]
    input_tokens: int = 0
    output_tokens: int = 0
    cost_usd: float = 0.0

@dataclass
class Trace:
    trace_id: str
    agent_name: str
    steps: list[AgentStep]
    status: str  # "running", "completed", "failed", "loop_detected"
    total_cost_usd: float = 0.0

The AgentTracer Class

class AgentTracer:
    def start_trace(self, agent_name, query, model) -> str:
        """Start a new trace. Returns trace_id."""

    def log_step(self, trace_id, step: AgentStep):
        """Log a completed step — accumulates tokens and cost."""

    def end_trace(self, trace_id, output, status="completed"):
        """Mark trace as complete."""

    def get_trace_json(self, trace_id) -> str:
        """Export trace as JSON for debugging."""

    def print_summary(self, trace_id):
        """Human-readable trace summary."""

In production, this would send data to Datadog, LangSmith, or Arize. For this course, we log to console and export to JSON.

Trace Output Example

============================================================
TRACE SUMMARY: a1b2c3d4
============================================================
Agent: react_agent | Model: gpt-4o
Status: completed
Query: What is the population of the capital of France?

Steps (3 total):
------------------------------------------------------------
  Step 1: 1200ms, $0.0085 -> search
  Step 2: 980ms,  $0.0062 -> search
  Step 3: 450ms,  $0.0031 (no tools)
------------------------------------------------------------
Total Tokens: 2847 input + 312 output = 3159
Total Cost:   $0.0178
Total Time:   2630ms

Answer: The population of Paris is approximately 2.1 million.
============================================================

C. Loop Detection

Catching agents that spin in circles

The Infinite Loop Problem

An agent calls search("python tutorial"), gets a result, then calls search("python tutorial") again. And again. And again.

Why it happens:

  • The model doesn’t “understand” the result satisfied the query
  • The tool returns an error, and the agent retries with the same arguments
  • The prompt is ambiguous, so the agent keeps trying variations

Cost Impact

A looping agent can burn through hundreds of API calls at $0.01-0.05 each. A single bad query can cost $5+ before max_steps kicks in.

Three Detection Strategies

graph LR
    TC["Tool Call"] --> E["Exact Match<br/>Same tool + same args"]
    TC --> F["Fuzzy Match<br/>Similar args (Jaccard)"]
    TC --> S["Output Stagnation<br/>Similar outputs repeated"]

    style E fill:#FF7A5C,stroke:#1C355E,color:#1C355E
    style F fill:#9B8EC0,stroke:#1C355E,color:#1C355E
    style S fill:#00C9A7,stroke:#1C355E,color:#1C355E

Strategy 1: Exact Match

Same tool + identical arguments repeated N times.

# Track history of (tool_name, arguments_string)
exact_count = sum(
    1 for past_tool, past_input in self.tool_history
    if (past_tool, past_input) == current
)

if exact_count >= self.exact_threshold:  # Default: 2
    return LoopDetectionResult(
        is_looping=True,
        strategy="exact",
        confidence=1.0,
        message="Exact loop detected! Change your approach."
    )

Confidence: 100% — this is always a loop.

Strategy 2: Fuzzy Match

Similar (but not identical) tool calls — catches minor rephrasing.

def _jaccard_similarity(self, s1: str, s2: str) -> float:
    tokens1 = set(s1.lower().split())
    tokens2 = set(s2.lower().split())
    intersection = tokens1 & tokens2
    union = tokens1 | tokens2
    return len(intersection) / len(union)
  • search("python tutorial basics") vs search("basics python tutorial") -> Jaccard: 1.0
  • search("python tutorial") vs search("python guide") -> Jaccard: 0.33
  • Threshold: 0.8 (catches rephrasings, ignores truly different queries)

Strategy 3: Output Stagnation

The agent keeps producing similar responses — not making progress.

# Check last N outputs for pairwise similarity
recent = self.output_history[-stagnation_window:]
similarities = []
for i in range(len(recent)):
    for j in range(i + 1, len(recent)):
        sim = self._jaccard_similarity(recent[i], recent[j])
        similarities.append(sim)

avg_similarity = sum(similarities) / len(similarities)
if avg_similarity >= 0.8:
    return LoopDetectionResult(is_looping=True, strategy="stagnation")

The Circuit Breaker Pattern

Combine loop detection with the agent loop to break infinite cycles:

for step in range(max_steps):
    # ... get LLM response, extract tool calls ...

    for tool_call in tool_calls:
        # Check BEFORE executing
        loop_check = loop_detector.check_tool_call(
            tool_call.name, str(tool_call.arguments)
        )

        if loop_check.is_looping:
            # Inject warning into conversation instead of executing
            messages.append({
                "role": "tool",
                "content": f"LOOP DETECTED: {loop_check.message}"
            })
            break  # Skip remaining tool calls this step

The agent receives the loop warning as if it were a tool result — it can then change strategy.

D. Cost Tracking

Monitoring spend per query

Why Track Cost?

Without tracking:

  • “Our AI bill was $500 this month”
  • “Some queries are expensive but we don’t know which”
  • No budget enforcement

With tracking:

  • “Query X cost $2.30 (15 steps)”
  • “Average cost: $0.12/query”
  • Budget alerts per query

Cost Tracking with LiteLLM

LiteLLM provides built-in cost calculation:

from litellm import completion_cost

response = completion(model="gpt-4o", messages=messages)

# Get cost from the response
cost = completion_cost(completion_response=response)
print(f"This step cost: ${cost:.4f}")

Integrate with the tracer:

tracer.log_cost(trace_id, step_number=step, cost_usd=cost)

# At the end of the trace:
# Total Cost: $0.0178  (sum of all steps)

Setting Budget Limits

class CostTracker:
    def __init__(self, budget_limit_usd: float = 1.0):
        self.budget_limit = budget_limit_usd
        self.total_spent = 0.0

    def add_cost(self, cost: float) -> bool:
        self.total_spent += cost
        if self.total_spent > self.budget_limit:
            raise BudgetExceededError(
                f"Query cost ${self.total_spent:.2f} "
                f"exceeds budget of ${self.budget_limit:.2f}"
            )
        return True

Production Tip: Set per-query budgets ($1-5) AND daily budgets ($50-500). A single runaway agent should never bankrupt your project.

E. LLM-as-Judge Evaluation

Measuring agent quality scientifically

The Evaluation Problem

How do you know if your agent is any good?

  • Manual review doesn’t scale
  • String matching can’t evaluate free-text answers
  • Unit tests check format, not quality

The Solution

Use another LLM as a judge to evaluate the agent’s outputs against reference answers on structured criteria.

Evaluation Criteria

Criterion Scale What It Measures
Accuracy 1-5 Are the facts correct?
Completeness 1-5 Does it address all parts of the question?
Hallucination 1-5 Does it invent information? (5 = none)
Conciseness 1-5 Is it the right length?

Overall Score = weighted average:

0.35 * Accuracy + 0.25 * Completeness + 0.25 * Hallucination + 0.15 * Conciseness

The Evaluation Pipeline

class AgentEvaluator:
    def evaluate_one(self, question, agent_answer, reference_answer):
        response = completion(
            model=self.judge_model,
            messages=[
                {"role": "system", "content": EVALUATION_PROMPT},
                {"role": "user", "content": f"""
                    QUESTION: {question}
                    REFERENCE ANSWER: {reference_answer}
                    AGENT'S ANSWER: {agent_answer}
                """}
            ],
            response_format={"type": "json_object"},
            temperature=0,  # Deterministic evaluation
        )
        scores = json.loads(response.choices[0].message.content)
        return EvaluationResult(**scores)

Evaluation Results Dashboard

======================================================================
EVALUATION RESULTS (5 queries)
======================================================================
Question                             Acc  Comp  Hall  Conc  Overall  Pass
----------------------------------------------------------------------
What is the capital of France?         5     5     5     5     5.00  PASS
Compare Python and JavaScript          4     4     4     3     3.90  PASS
Latest AI research trends              3     3     4     3     3.30  FAIL
Population of Tokyo metropolitan..     5     4     5     4     4.60  PASS
Explain quantum computing simply       4     3     5     4     3.95  PASS
----------------------------------------------------------------------
AVERAGE                              4.2   3.8   4.6   3.8    4.15

Pass Rate: 80% (4/5)
======================================================================

Building an Eval Dataset

[
  {
    "question": "What is the population of Tokyo?",
    "reference": "The population of Tokyo is approximately 14 million...",
    "category": "factual",
    "difficulty": "easy"
  },
  {
    "question": "Compare renewable energy policies of EU and US",
    "reference": "The EU has committed to... while the US...",
    "category": "comparison",
    "difficulty": "hard"
  }
]

Best Practice

Maintain a living eval dataset (20-50 examples). Run evals before every agent change. “It feels better” becomes “accuracy improved by 14%.”

F. Wrap-up

Key Takeaways

  1. Tracing is non-negotiable — every step, every tool call, every cost
  2. Loop detection uses three strategies: exact, fuzzy, and stagnation
  3. Circuit breakers inject loop warnings into the conversation
  4. Cost tracking prevents runaway spending with per-query budgets
  5. LLM-as-Judge turns “it feels better” into measurable metrics

Lab Preview: The Broken Agent

Step 1: Instrumentation

  • Inject AgentTracer into ReactAgent
  • Run a query that triggers a loop

Step 2: Diagnosis

  • Read the trace JSON/logs
  • Identify the repeating tool calls

Step 3: The Fix

  • Implement a circuit breaker
  • Add loop detection to the agent loop
  • Verify with the evaluation suite

Time: 75 minutes

Questions?

Session 3 Complete